Goto

Collaborating Authors

 reference measure


Non-convex entropic mean-field optimization via Best Response flow

Neural Information Processing Systems

We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the $L^1$-Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.


Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

arXiv.org Machine Learning

In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.



Alpha Divergence Losses for Biometric Verification

arXiv.org Artificial Intelligence

Performance in face and speaker verification is largely driven by margin-based softmax losses such as CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly due to their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find that this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a training instability in A3M-caused by sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is critical for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount. Finally, the sparsity of $α$-divergence-based posteriors enables memory-efficient training, which is crucial for datasets with millions of identities.



Entropic optimal transport beyond product reference couplings: the Gaussian case on Euclidean space

arXiv.org Machine Learning

The optimal transport problem with squared Euclidean cost consists in finding a coupling between two input measures that maximizes correlation. Consequently, the optimal coupling is often singular with respect to Lebesgue measure. Regularizing the optimal transport problem with an entropy term yields an approximation called entropic optimal transport. Entropic penalties steer the induced coupling toward a reference measure with desired properties. For instance, when seeking a diffuse coupling, the most popular reference measures are the Lebesgue measure and the product of the two input measures. In this work, we study the case where the reference coupling is not necessarily assumed to be a product. We focus on the Gaussian case as a motivating paradigm, and provide a reduction of this more general optimal transport criterion to a matrix optimization problem. This reduction enables us to provide a complete description of the solution, both in terms of the primal variable and the dual variables. We argue that flexibility in terms of the reference measure can be important in statistical contexts, for instance when one has prior information, when there is uncertainty regarding the measures to be coupled, or to reduce bias when the entropic problem is used to estimate the un-regularized transport problem. In particular, we show in numerical examples that choosing a suitable reference plan allows to reduce the bias caused by the entropic penalty.


Linearized Optimal Transport pyLOT Library: A Toolkit for Machine Learning on Point Clouds

arXiv.org Machine Learning

Instead, point clouds or continuous probability measures are the appropriate data structures. These data arise naturally in fields such as computer vision, image processing, shape analysis, and generative modeling, where representing complex objects as probability distributions provides a richer and more flexible framework for analysis. Real-world examples include text documents with bag-of-words models treating word counts as features, which forms a histogram for each document [35], imaging data where pixel intensity is interpreted as mass [26] and results in 2D discrete probability measures over the image grid, and gene expression data that is interpretted as a distribution across a gene network [8, 15]. Optimal transport (OT) theory [30] has recently emerged as a powerful tool to compare probability measures. Qualitatively, OT generates a distance metric between probability measures by minimizing the work needed to move one distribution into another over all transport plans. It has gained significant popularity for applications [4, 26, 27] involving point clouds and probability distributions. OT allows for the computation of distances between distributions by solving a minimization problem over transportation plans. Despite its theoretical elegance and its ability to capture geometric properties of distributions, using vanilla OT is computationally expensive and does not directly integrate into existing machine learning pipelines. For this reason, OT has been somewhat limited in practical applications, particularly in settings that demand scalable and efficient algorithms for tasks such as classification, dimension reduction, and generation.


Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport

arXiv.org Machine Learning

We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of $m$ reference measures given a set of coefficients belonging to the $m$-dimensional simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure $\mu$. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of regularized barycenters as solutions to a fixed-point equation for the average of the entropic maps from the barycenter to the reference measures. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when $\mu$ is a barycenter. It is shown that these coordinates, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, a hallmark of entropy-regularized optimal transport, and we verify these rates experimentally. We also establish that barycentric coordinates are stable with respect to perturbations in the Wasserstein-2 metric, suggesting a robustness of these coefficients to corruptions. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.


Linearized Wasserstein Barycenters: Synthesis, Analysis, Representational Capacity, and Applications

arXiv.org Machine Learning

We propose the \textit{linear barycentric coding model (LBCM)} that utilizes the linear optimal transport (LOT) metric for analysis and synthesis of probability measures. We provide a closed-form solution to the variational problem characterizing the probability measures in the LBCM and establish equivalence of the LBCM to the set of Wasserstein-2 barycenters in the special case of compatible measures. Computational methods for synthesizing and analyzing measures in the LBCM are developed with finite sample guarantees. One of our main theoretical contributions is to identify an LBCM, expressed in terms of a simple family, which is sufficient to express all probability measures on the interval $[0,1]$. We show that a natural analogous construction of an LBCM in $\mathbb{R}^2$ fails, and we leave it as an open problem to identify the proper extension in more than one dimension. We conclude by demonstrating the utility of LBCM for covariance estimation and data imputation.


Learning signals defined on graphs with optimal transport and Gaussian process regression

arXiv.org Machine Learning

Due to the associated computational cost, machine learning (ML) is a natural In computational physics, machine learning candidate to accelerate such design exploration: has now emerged as a powerful complementary starting from an initial database of FEM simulations, tool to explore efficiently candidate designs a supervised model is trained to predict the FEM outputs in engineering studies. Outputs in such from its inputs and is ultimately used as a proxy supervised problems are signals defined on to evaluate new geometries with a negligible cost. But meshes, and a natural question is the extension in this context, the supervised learning task actually of general scalar output regression involves inputs given as meshes, which can be modeled models to such complex outputs. Changes as graphs with continuous node attributes, different between input geometries in terms of both number of nodes and edges. In addition, the outputs size and adjacency structure in particular can be scalar values but also physical quantities of interest make this transition non-trivial. In this work, defined on each node of the input graph, which we propose an innovative strategy for Gaussian we refer to as signals defined on graphs or fields.